由于对音乐流媒体/推荐服务的需求增加以及音乐信息检索框架的最新发展,音乐流派分类(MGC)引起了社区的关注。但是,已知基于卷积的方法缺乏有效编码和定位时间特征的能力。在本文中,我们研究了基于广播的神经网络,旨在提高一小部分参数(约180k)下的本地化和概括性,并研究了12个广播网络的变体,讨论了块配置,汇总方法,激活功能,归一化的效果机理,标签平滑,通道相互依赖性,LSTM块包含和成立方案的变体。我们使用相关数据集进行的计算实验,例如GTZAN,扩展宴会厅,Homburg和Free Music Archive(FMA),显示了音乐流派分类中最新的分类精度。我们的方法提供了洞察力,并有可能使音乐和音频分类启用紧凑且可推广的广播网络。
translated by 谷歌翻译
This paper proposes a novel observer-based controller for Vertical Take-Off and Landing (VTOL) Unmanned Aerial Vehicle (UAV) designed to directly receive measurements from a Vision-Aided Inertial Navigation System (VA-INS) and produce the required thrust and rotational torque inputs. The VA-INS is composed of a vision unit (monocular or stereo camera) and a typical low-cost 6-axis Inertial Measurement Unit (IMU) equipped with an accelerometer and a gyroscope. A major benefit of this approach is its applicability for environments where the Global Positioning System (GPS) is inaccessible. The proposed VTOL-UAV observer utilizes IMU and feature measurements to accurately estimate attitude (orientation), gyroscope bias, position, and linear velocity. Ability to use VA-INS measurements directly makes the proposed observer design more computationally efficient as it obviates the need for attitude and position reconstruction. Once the motion components are estimated, the observer-based controller is used to control the VTOL-UAV attitude, angular velocity, position, and linear velocity guiding the vehicle along the desired trajectory in six degrees of freedom (6 DoF). The closed-loop estimation and the control errors of the observer-based controller are proven to be exponentially stable starting from almost any initial condition. To achieve global and unique VTOL-UAV representation in 6 DoF, the proposed approach is posed on the Lie Group and the design in unit-quaternion is presented. Although the proposed approach is described in a continuous form, the discrete version is provided and tested. Keywords: Vision-aided inertial navigation system, unmanned aerial vehicle, vertical take-off and landing, stochastic, noise, Robotics, control systems, air mobility, observer-based controller algorithm, landmark measurement, exponential stability.
translated by 谷歌翻译
Owing to the success of transformer models, recent works study their applicability in 3D medical segmentation tasks. Within the transformer models, the self-attention mechanism is one of the main building blocks that strives to capture long-range dependencies, compared to the local convolutional-based design. However, the self-attention operation has quadratic complexity which proves to be a computational bottleneck, especially in volumetric medical imaging, where the inputs are 3D with numerous slices. In this paper, we propose a 3D medical image segmentation approach, named UNETR++, that offers both high-quality segmentation masks as well as efficiency in terms of parameters and compute cost. The core of our design is the introduction of a novel efficient paired attention (EPA) block that efficiently learns spatial and channel-wise discriminative features using a pair of inter-dependent branches based on spatial and channel attention. Our spatial attention formulation is efficient having linear complexity with respect to the input sequence length. To enable communication between spatial and channel-focused branches, we share the weights of query and key mapping functions that provide a complimentary benefit (paired attention), while also reducing the overall network parameters. Our extensive evaluations on three benchmarks, Synapse, BTCV and ACDC, reveal the effectiveness of the proposed contributions in terms of both efficiency and accuracy. On Synapse dataset, our UNETR++ sets a new state-of-the-art with a Dice Similarity Score of 87.2%, while being significantly efficient with a reduction of over 71% in terms of both parameters and FLOPs, compared to the best existing method in the literature. Code: https://github.com/Amshaker/unetr_plus_plus.
translated by 谷歌翻译
Automatic speech recognition research focuses on training and evaluating on static datasets. Yet, as speech models are increasingly deployed on personal devices, such models encounter user-specific distributional shifts. To simulate this real-world scenario, we introduce LibriContinual, a continual learning benchmark for speaker-specific domain adaptation derived from LibriVox audiobooks, with data corresponding to 118 individual speakers and 6 train splits per speaker of different sizes. Additionally, current speech recognition models and continual learning algorithms are not optimized to be compute-efficient. We adapt a general-purpose training algorithm NetAug for ASR and create a novel Conformer variant called the DisConformer (Disentangled Conformer). This algorithm produces ASR models consisting of a frozen 'core' network for general-purpose use and several tunable 'augment' networks for speaker-specific tuning. Using such models, we propose a novel compute-efficient continual learning algorithm called DisentangledCL. Our experiments show that the DisConformer models significantly outperform baselines on general ASR i.e. LibriSpeech (15.58% rel. WER on test-other). On speaker-specific LibriContinual they significantly outperform trainable-parameter-matched baselines (by 20.65% rel. WER on test) and even match fully finetuned baselines in some settings.
translated by 谷歌翻译
End-to-end multilingual ASR has become more appealing because of several reasons such as simplifying the training and deployment process and positive performance transfer from high-resource to low-resource languages. However, scaling up the number of languages, total hours, and number of unique tokens is not a trivial task. This paper explores large-scale multilingual ASR models on 70 languages. We inspect two architectures: (1) Shared embedding and output and (2) Multiple embedding and output model. In the shared model experiments, we show the importance of tokenization strategy across different languages. Later, we use our optimal tokenization strategy to train multiple embedding and output model to further improve our result. Our multilingual ASR achieves 13.9%-15.6% average WER relative improvement compared to monolingual models. We show that our multilingual ASR generalizes well on an unseen dataset and domain, achieving 9.5% and 7.5% WER on Multilingual Librispeech (MLS) with zero-shot and finetuning, respectively.
translated by 谷歌翻译
Self-supervised learning via masked prediction pre-training (MPPT) has shown impressive performance on a range of speech-processing tasks. This paper proposes a method to bias self-supervised learning towards a specific task. The core idea is to slightly finetune the model that is used to obtain the target sequence. This leads to better performance and a substantial increase in training speed. Furthermore, this paper proposes a variant of MPPT that allows low-footprint streaming models to be trained effectively by computing the MPPT loss on masked and unmasked frames. These approaches are evaluated for automatic speech recognition on the Librispeech corpus, where 100 hours of data served as the labelled data and 860 hours as the unlabelled data. The biased training outperforms the unbiased training by 15.5% after 250k updates and 23.8% after 100k updates on test-other. For the streaming models, the pre-training approach yields a reduction in word error rate of 44.1%.
translated by 谷歌翻译
持续学习和少数学习是追求改善机器学习的重要领域。每个边界的工作越来越多,但将两者结合起来很少。但是最近,Antoniou等人。 ARXIV:2004.11967引入了一个连续的少数学习框架CFSL,将两者都结合在一起。在这项研究中,我们扩展了CFSL,以使其与标准持续学习实验更具可比性,通常会介绍更多的类。我们还引入了一个“实例测试”以对非常相似的特定实例进行分类 - ML通常忽略的动物认知能力。我们从原始CFSL工作中选择了代表性的基线模型,并将其与具有海马启发性重播的模型进行了比较,因为海马被认为对动物中的这种学习至关重要。正如预期的那样,学习更多的课程比原始的CFSL实验更加困难,有趣的是,它们的呈现方式对性能有所不同。实例测试中的准确性与分类任务相当。使用重播进行合并可改善两种类型的任务的性能,尤其是实例测试。
translated by 谷歌翻译
了解人类的行为和监测心理健康对于维持社区和社会的安全至关重要。由于不受控制的心理健康,由于心理健康期间,由于心理健康的大流行期间的心理健康问题有所增加,因此对心理问题的早期发现至关重要。如今,智能虚拟个人助理(IVA)的使用已在全球范围内增加。个人使用声音来控制这些设备以满足请求并获得不同的服务。本文提出了一种基于封闭式复发性神经网络和卷积神经网络的新型深度学习模型,以了解人类的情感从语音中,以改善其IVA服务并监控其心理健康。
translated by 谷歌翻译
行动检测和公共交通安全是安全社区和更好社会的关键方面。使用不同的监视摄像机监视智能城市中的交通流量可以在识别事故和提醒急救人员中发挥重要作用。计算机视觉任务中的动作识别(AR)的利用为视频监视,医学成像和数字信号处理中的高精度应用做出了贡献。本文提出了一项密集的审查,重点是智能城市的事故检测和自动运输系统中的行动识别。在本文中,我们专注于使用各种交通视频捕获来源的AR系统,例如交通交叉点上的静态监视摄像头,高速公路监控摄像头,无人机摄像头和仪表板。通过这篇综述,我们确定了AR中用于自动运输和事故检测的主要技术,分类法和算法。我们还检查了AR任务中使用的数据集,并识别数据集的数据集和功能的主要来源。本文提供了潜在的研究方向,以开发和整合为自动驾驶汽车和公共交通安全系统的事故检测系统,通过警告紧急人员和执法部门,如果道路事故发生道路事故,以最大程度地减少事故报告中的人为错误,并对受害者提供自发的反应。
translated by 谷歌翻译
在过去的十年中,在线教育在为全球学生提供负担得起的高质量教育方面的重要性越来越重要。随着越来越多的学生改用在线学习,这在全球大流行期间得到了进一步放大。大多数在线教育任务,例如课程建议,锻炼建议或自动化评估,都取决于跟踪学生的知识进步。这被称为文献中的\ emph {知识跟踪}问题。解决此问题需要收集学生评估数据,以反映他们的知识演变。在本文中,我们提出了一个新的知识跟踪数据集,名为“知识跟踪数据库”练习(DBE-KT22),该练习是在澳大利亚澳大利亚国立大学教授的课程中从在线学生锻炼系统中收集的。我们讨论了DBE-KT22数据集的特征,并将其与知识追踪文献中的现有数据集进行对比。我们的数据集可通过澳大利亚数据存档平台公开访问。
translated by 谷歌翻译